The purpose of this report is to address a case study for Cyclistic, a fictional bike-share company based in Chicago. The steps of the analysis process are as follows: Ask, Prepare, Process, Analyze, Share, and Act.
To date, Cyclistic’s marketing strategy has focused on building general awareness and appealing to broad consumer segments. One factor facilitating this was the flexibility of its pricing plans, including single-ride passes, full-day passes, and annual memberships.
Customers who purchase single-ride or full-day passes are categorized as casual riders.
Customers who purchase annual memberships are recognized as Cyclistic members.
Cyclistic’s financial analysts have determined that annual members are significantly more profitable than casual riders.
The marketing director, Lily Moreno, believes that the company’s future success hinges on maximizing the conversion of casual riders to annual members. She notes that casual riders are already familiar with the Cyclistic program and have chosen it for their mobility needs.
The goal of this case study is to investigate how annual members and casual riders use Cyclistic bikes differently. Based on these insights, we will design a new marketing strategy to convert casual riders into yearly members. Understanding these differences will be crucial for driving business decisions and maximizing profitability.
The case study includes the following data sets from the Google platform for academic purposes.
# Inspect the dataframes and look for incongruities
glimpse(q1_2019)
## Rows: 365,069
## Columns: 12
## $ trip_id <dbl> 21742443, 21742444, 21742445, 21742446, 21742447, 21…
## $ start_time <chr> "2019-01-01 0:04:37", "2019-01-01 0:08:13", "2019-01…
## $ end_time <chr> "2019-01-01 0:11:07", "2019-01-01 0:15:34", "2019-01…
## $ bikeid <dbl> 2167, 4386, 1524, 252, 1170, 2437, 2708, 2796, 6205,…
## $ tripduration <dbl> 390, 441, 829, 1783, 364, 216, 177, 100, 1727, 336, …
## $ from_station_id <dbl> 199, 44, 15, 123, 173, 98, 98, 211, 150, 268, 299, 2…
## $ from_station_name <chr> "Wabash Ave & Grand Ave", "State St & Randolph St", …
## $ to_station_id <dbl> 84, 624, 644, 176, 35, 49, 49, 142, 148, 141, 295, 4…
## $ to_station_name <chr> "Milwaukee Ave & Grand Ave", "Dearborn St & Van Bure…
## $ usertype <chr> "Subscriber", "Subscriber", "Subscriber", "Subscribe…
## $ gender <chr> "Male", "Female", "Female", "Male", "Male", "Female"…
## $ birthyear <dbl> 1989, 1990, 1994, 1993, 1994, 1983, 1984, 1990, 1995…
glimpse(q1_2020)
## Rows: 426,887
## Columns: 13
## $ ride_id <chr> "EACB19130B0CDA4A", "8FED874C809DC021", "789F3C21E4…
## $ rideable_type <chr> "docked_bike", "docked_bike", "docked_bike", "docke…
## $ started_at <chr> "2020-01-21 20:06:59", "2020-01-30 14:22:39", "2020…
## $ ended_at <chr> "2020-01-21 20:14:30", "2020-01-30 14:26:22", "2020…
## $ start_station_name <chr> "Western Ave & Leland Ave", "Clark St & Montrose Av…
## $ start_station_id <dbl> 239, 234, 296, 51, 66, 212, 96, 96, 212, 38, 117, 1…
## $ end_station_name <chr> "Clark St & Leland Ave", "Southport Ave & Irving Pa…
## $ end_station_id <dbl> 326, 318, 117, 24, 212, 96, 212, 212, 96, 100, 632,…
## $ start_lat <dbl> 41.9665, 41.9616, 41.9401, 41.8846, 41.8856, 41.889…
## $ start_lng <dbl> -87.6884, -87.6660, -87.6455, -87.6319, -87.6418, -…
## $ end_lat <dbl> 41.9671, 41.9542, 41.9402, 41.8918, 41.8899, 41.884…
## $ end_lng <dbl> -87.6674, -87.6644, -87.6530, -87.6206, -87.6343, -…
## $ member_casual <chr> "member", "member", "member", "member", "member", "…
Reliable
Yes: The City of Chicago and Divvy, the official bike-sharing operator, provides the data. This lends credibility to the data reliability.
Original
Yes: The datasets are the original records of Divvy bike trips, directly generated by the system’s tracking mechanisms.
Comprehensive
Mostly Yes: The datasets contain a significant amount of information about each trip, including time stamps, station locations, and user types.
Current
Partially: While historical datasets like “2019_Q1” and “2020_Q1” are valuable, they are not continuously updated in the duplicate files.
Explanation: The City of Chicago and Divvy provide ongoing data releases, but you must find the most recent datasets to get the most current information. Therefore, the datasets you mentioned are historical.
Cited
Yes: The data is generally available from reputable sources: The City of Chicago Data Portal is a recognized source for public data. Divvy’s official website.
Motivate International Inc. has made the data available under this license.
Those datasets are public data that explore how different customer types use Cyclistic bikes.
Data-privacy issues prohibit the use of riders’ personally identifiable information. This means that you won’t be able to connect pass purchases to credit card numbers to determine if casual riders live in the Cyclistic service area or if they have purchased multiple single passes.
In both datasets, time stamps are formatted as “YYYY-MM-DD hh:mm: ss,” ensuring consistent time tracking.
The Divvy_Trips_2019_Q1 data set includes trip duration in seconds within the tripduration field, which is absent in Divvy_Trips_2020_Q1.
User classifications changed from customer and subscriber in 2019 to casual and member in 2020, reflecting a shift in our understanding of users.
The ride_id and rideable_type fields also show differing data types between the datasets.
By selecting essential fields and standardizing column names, we can streamline our analysis and make the most of this valuable data!
We chose RStudio because it is a powerful tool that integrates visualization, reporting, and analyzing large data sets. Programming saves time and effort when interacting with data.
A copy of the CVS files was saved on the Desktop in the directory ..DESKTOP/BIKE _SHARE/ORIGINAL DATA.
Data transformation will be performed in the following chunks to stack correctly when combined as a single file.
(q1_2019 <- rename(q1_2019
,ride_id = trip_id
,rideable_type = bikeid
,started_at = start_time
,ended_at = end_time
,start_station_name = from_station_name
,start_station_id = from_station_id
,end_station_name = to_station_name
,end_station_id = to_station_id
,member_casual = usertype
))
## # A tibble: 365,069 × 12
## ride_id started_at ended_at rideable_type tripduration start_station_id
## <dbl> <chr> <chr> <dbl> <dbl> <dbl>
## 1 21742443 2019-01-01 0:0… 2019-01… 2167 390 199
## 2 21742444 2019-01-01 0:0… 2019-01… 4386 441 44
## 3 21742445 2019-01-01 0:1… 2019-01… 1524 829 15
## 4 21742446 2019-01-01 0:1… 2019-01… 252 1783 123
## 5 21742447 2019-01-01 0:1… 2019-01… 1170 364 173
## 6 21742448 2019-01-01 0:1… 2019-01… 2437 216 98
## 7 21742449 2019-01-01 0:1… 2019-01… 2708 177 98
## 8 21742450 2019-01-01 0:1… 2019-01… 2796 100 211
## 9 21742451 2019-01-01 0:1… 2019-01… 6205 1727 150
## 10 21742452 2019-01-01 0:1… 2019-01… 3939 336 268
## # ℹ 365,059 more rows
## # ℹ 6 more variables: start_station_name <chr>, end_station_id <dbl>,
## # end_station_name <chr>, member_casual <chr>, gender <chr>, birthyear <dbl>
q1_2019 <- mutate(q1_2019, ride_id = as.character(ride_id)
,rideable_type = as.character(rideable_type))
q1_2019 <- q1_2019 %>%
mutate(member_casual = recode(member_casual
,"Subscriber" = "member"
,"Customer" = "casual"))
#check the unique values of the column(validating the field)
unique(q1_2019$member_casual)
## [1] "member" "casual"
# Remove lat, long, birthyear, and gender fields, as this data was dropped beginning in 2020
q1_2019 <- q1_2019 %>%
select(-c( birthyear, gender, "tripduration"))
q1_2020 <- q1_2020 %>%
select(-c(start_lat, start_lng, end_lat, end_lng))
conteo_na19 <- q1_2019 %>%
summarise(across(everything(), ~ sum(is.na(.))))
conteo_na20 <- q1_2020 %>%
summarise(across(everything(), ~ sum(is.na(.))))
options(width = 10000)# to see all fields on screen
print(conteo_na19)
## # A tibble: 1 × 9
## ride_id started_at ended_at rideable_type start_station_id start_station_name end_station_id end_station_name member_casual
## <int> <int> <int> <int> <int> <int> <int> <int> <int>
## 1 0 0 0 0 0 0 0 0 0
print(conteo_na20)
## # A tibble: 1 × 9
## ride_id rideable_type started_at ended_at start_station_name start_station_id end_station_name end_station_id member_casual
## <int> <int> <int> <int> <int> <int> <int> <int> <int>
## 1 0 0 0 0 0 0 1 1 0
options(width = 80) #Reset to default
We found two missing values for station names. Let’s erase these rows in 2020.
q1_2020 <- na.omit(q1_2020)
paste("Any NA value:",any(is.na(q1_2020)))
## [1] "Any NA value: FALSE"
The comparative analysis of bike usage between members and casual riders focused on the duration and number of trips (analyzed across different periods), peak usage times, and main routes to uncover usage patterns, relationships, and trends. These insights will enable Cyclistic to understand the differences in behavior between customer groups and tailor incentives and communication better to align their needs with the benefits of membership, ultimately aiming to drive conversions.
This section documents every cleaning task after the calculations and new fields are incorporated into the analysis process.
# Stack individual quarters' data frames into one big data frame
all_trips <- bind_rows(q1_2019, q1_2020)
paste("Any duplicated ride_id:",any(duplicated(all_trips$ride_id)))
## [1] "Any duplicated ride_id: FALSE"
all_trips$ride_length <- difftime(all_trips$ended_at,all_trips$started_at)
The following code converts “ride_length” from Factor to numeric so we can run calculations on the data (in seconds) and answer some criteria, looking for bad data.
all_trips$ride_length <- as.numeric(as.character(all_trips$ride_length))
paste("Numeric ride_length?", is.numeric(all_trips$ride_length))
## [1] "Numeric ride_length? TRUE"
paste(sum(is.na(all_trips$ride_length)), "NA values") # See how many NAs were created
## [1] "0 NA values"
paste( sum(all_trips$member_casual == "casual" & all_trips$ride_length > 14400), "huge casual's rides")
## [1] "1044 huge casual's rides"
paste(sum(all_trips$member_casual == "member" & all_trips$ride_length > 3600), " huge member's rides")
## [1] "2366 huge member's rides"
paste(sum(all_trips$ride_length < 0), " negatives ride durations")
## [1] "116 negatives ride durations"
paste(sum(all_trips$start_station_name == "HQ QR"), "taken out of docks and checked for quality by Divvy.")
## [1] "3766 taken out of docks and checked for quality by Divvy."
A “HQ QR” station wouldn’t be a typical public-facing bike docking station where regular users begin or end their trips. Instead, it would be a designated point associated with Divvy’s operational activities, like bike maintenance or redistribution. That is why we wouldn’t consider this station.
The data frame includes 116 negative ride_length entries, a tiny proportion that we will remove because they are clear data errors.
We get back to this cleaning point, trying to threshold the scope of our analysis because we found huge rides that skew the statistics and are not representative based on Divvy Bikes’ pricing terms:
Day Pass Terms: Divvy’s Day Pass typically includes unlimited rides of up to 3 hours each within 24 hours. Rides exceeding this duration incur additional per-minute charges.
Annual Membership Terms: Divvy annual memberships usually include the first 45 minutes of each ride for free. Rides longer than 45 minutes incur additional per-minute charges.
To address these issues, we set the upper limit of ride length for casuals to 14400 and members to 3600. According to these criteria, we will remove 1,5 % of the total casual observations and 0.3 % of the members’ registers.
The setting of upper limits is reasonable because it improves the representativeness of the analysis by focusing on typical usage patterns for casual and member riders while removing a tiny percentage of potentially anomalous or outlier data.
The chunk below removes data based on the criteria above, creating a new version (v2) of a data frame.
# https://www.datasciencemadesimple.com/delete-or-drop-rows-in-r-with-conditions-2/
all_trips_v2 <- all_trips[!(all_trips$start_station_name == "HQ QR" | all_trips$ride_length<0 | all_trips$ride_length>14400 | (all_trips$member_casual == "member")& (all_trips$ride_length > 3600)),]
In the following chunk, we add the fields with different periods to the data set all_trips_v2 and save it in the directory ..DESKTOP/BIKE _SHARE/ORIGINAL DATA, in case it would be necessary for further analysis.
all_trips_v2$date <- as.Date(all_trips_v2$started_at) #The default format is yyyy-mm-dd
all_trips_v2$month <- format(all_trips_v2$date, "%m")
all_trips_v2$year <- format(all_trips_v2$date, "%Y")
all_trips_v2$day_of_week <- format(all_trips_v2$date, "%A")
write_csv(all_trips_v2,"all_trips_v2.csv")
The initial phase of the analysis involved a comprehensive descriptive exploration of the all_trips_v2 data set to establish a foundational understanding of bike usage patterns. This included examining key variables such as trip duration (ride_length), trip frequency, and temporal aspects like monthly, weekly, and daily usage trends. We also investigated the distribution of trip start times to identify peak usage periods and analyzed the prevalence of different routes. Statistical summaries, including central tendency and dispersion measures, were calculated for ride_length to understand its overall distribution and compare it between member and casual riders. This descriptive overview provides the context for the subsequent visual exploration and identification of key findings regarding user behavior and usage patterns.
Regarding ridership patterns , the analysis revealed that member bike usage is significantly more stable and consistently higher in total volume than casual riders. The data suggests distinct usage behaviors: members frequently take short trips, while casual riders take longer trips with significantly fewer rides. Notably, casual riders use bikes on weekends and afternoons, whereas members primarily use them on weekdays and during peak hours. Members also favor the most popular routes, while casual riders prefer specific starting and ending stations.
Concerning relationships, a key and somewhat surprising finding emerged: while the total volume of trips consistently shows significantly higher usage by members, the statistics analyzed within each period (monthly, weekly, daily) indicate the opposite regarding individual usage. This apparent contradiction is driven by the markedly longer trip duration of casual riders and their substantially lower number of trips compared to members. This interplay between trip length and frequency directly influences the aggregate statistics, creating this noteworthy dynamic in the data.
Finally,regarding trends, the analysis identified an increase in casual rider bike usage during warmer months, a trend that extended to all users by 2020. Furthermore, the casual user segment is growing faster than members, indicating the current strategy’s effectiveness in broad user acquisition.
These key findings provide a solid foundation for understanding how members and casual riders utilize the bike-share service. This can inform the development of more effective strategies for casual rider conversion. The following section presents detailed visualizations that support these findings and delve deeper into the specific behaviors of each group.
total_usage <- all_trips_v2 %>% group_by(member_casual) %>% summarize(Total_Ride_Length = sum(ride_length), Number_of_Rides = n() )
print(total_usage)
## # A tibble: 2 × 3
## member_casual Total_Ride_Length Number_of_Rides
## <chr> <dbl> <int>
## 1 casual 128166358 66833
## 2 member 464221012 717946
The statistical comparison is based on basic statistics (Minimum, Maximum, Quartiles, Median, and Mean) of ride duration. It helps uncover the distribution of ride duration by user type.
# Apply the summary function to the ride_length for each group
summary_by_user <- all_trips_v2 %>%
group_by(member_casual)%>%
summarise(summary_ride_length = list(summary(ride_length))) %>%
unnest_wider(summary_ride_length)
options(digits = 10)
print(summary_by_user)
## # A tibble: 2 × 7
## member_casual Min. `1st Qu.` Median Mean `3rd Qu.` Max.
## <chr> <table[1d]> <table[1d]> <table[1d]> <table[1d]> <table[1> <tab>
## 1 casual 2 772 1373 1917.7106818 2276 14385
## 2 member 1 317 506 646.5960003 819 3600
KEY TAKEAWAYS
Ride Duration Difference: Even after setting upper limits, casual riders consistently exhibit longer ride duration than members across all statistical measures (median, mean, quartiles).
Typical Usage Patterns: Members predominantly use the bikes for shorter trips (median around 8.4 minutes), likely for commuting or quick errands. Casual riders have a much longer typical ride duration (median around 22.9 minutes), suggesting more leisure-oriented or longer single-use trips.
Distribution Shape: The ride duration distribution for casual riders remains more right-skewed than for members, indicating a greater tendency for longer rides within their allowed window.
Impact of Upper Limits: The upper limits are now reflected in the maximum values, effectively removing the extreme outliers that were present before. This provides a more focused view on most ride durations within a reasonable time frame for each user type.
In the following chunks, we will analyze the bike usage by month and daily, using the median function because it is less affected by the number of rides as a measure of central tendency.
In this section, we show two bar graphs to compare the monthly bike usage of members and casual riders—one with the ride_length behavior, and the other with the number of rides.
#Let's analyze monthly bike usage per user
monthly_usage <-all_trips_v2 %>% group_by(month,member_casual) %>%
summarise( ride_length_median = median(ride_length), .groups = "drop")
plot_length <- ggplot(monthly_usage, aes(x = month, y =ride_length_median, fill = member_casual )) + geom_col(position = "dodge")
plot_ride_count <- ggplot(all_trips_v2, aes(x = month, fill = member_casual )) + geom_bar(position = "dodge")+ scale_y_continuous(labels = label_number(accuracy = 1))
combined_plot <- plot_length + plot_ride_count +
plot_layout(ncol = 2) + # Especifies that plots are placed in two columns
plot_annotation(title = "Monthly Comparison of Ride Length and Number of Trips by User")
print(combined_plot, width = 12, height = 6)
KEY TAKEAWAYS
Different Usage Patterns: Members use the bike-sharing service much more frequently, suggesting a potential use for commuting or short, regular trips. Casual riders use the service less often but for significantly longer durations, indicating more leisure-oriented or longer, single-use trips.
Temporal Trends: Both the median ride length and the number of rides show variation across the months, suggesting seasonal or monthly patterns in usage. The increase in casual ridership towards March could be due to improving weather.
Dominance of Members: Members constitute most of the rides taken during these first three months.
# See the median ride time by each day for members vs casual users
all_trips_v2$day_of_week <- ordered(all_trips_v2$day_of_week, levels=c("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday"))
all_trips_v2 %>%
group_by(day_of_week,member_casual) %>%
summarise(median_ride_length = median(ride_length), .groups = "drop") %>% ggplot(aes(x=day_of_week, y = median_ride_length, fill = member_casual)) +
geom_col(position = "dodge")+
labs(title = "Median Ride Length by Day of Week and User Type")+
theme(axis.text.x = element_text(angle = 45, hjust = 1))
KEY TAKEAWAYS
The median reflects a clear pattern in bike usage across the week.
Casual riders demonstrate significantly higher bike usage on every day of the week.
Notably, weekends are the peak days for casual riders, highlighting their preference for biking during this time.
In comparison, members showcase a more consistent and stable pattern of bike usage, emphasizing the reliability of their riding habits.
data <- all_trips_v2 %>%
group_by(member_casual, date, year) %>%
summarise(daily_avg = mean(ride_length), .groups = "drop") %>%
mutate(year = as.factor(year)) # Convert year to factor
contrasting_colors <- c("member" = "#1f78b4", "casual" = "#e31a1c")
plots <- lapply(unique(data$year), function(y) {
p <- plot_ly(data %>% filter(year == y),
x = ~date,
y = ~daily_avg,
color = ~member_casual,
colors = contrasting_colors,
text = ~paste("Date: ", date, "<br>Mean: ", round(daily_avg, 2)),
hoverinfo = "text") %>%
add_trace(type = "scatter", mode = "lines", name = ~member_casual, showlegend = (y == unique(data$year)[1])) %>%
add_trace(type = "scatter", mode = "markers", name = ~member_casual, showlegend = FALSE) %>%
layout(
annotations = list(
x = 0.75,
y = 0.95, # Adjusted y value (lower than 1.05)
text = paste("Year:", y),
xref = "paper",
yref = "paper",
showarrow = FALSE
),
xaxis = list(title = list(text = ""), tickformat = "%b"),
yaxis = list(title = list(text = ""))
)
return(p)
})
subplot(plots, nrows = 1, shareX = TRUE, titleX = TRUE) %>%
layout(title = list(text = "Average Ride Length by Date and User Type"),
width = 800, # Ajusta este valor
height = 400) # Ajusta este valor
KEY TAKEAWAYS
Again, members’ average daily bike usage patterns are more consistent and stable than casual riders. The average ride length for casual riders appears more volatile and often reaches higher peaks than for members.
The daily usage average is significantly higher for casual riders than members.
Casual riders show seasonal variation in ride lengths, with greater fluctuations in the spring (March-April) compared to winter (January-February). Notably, average ride lengths increased in the spring of 2020 compared to 2019, potentially linked to broader societal changes.
The daily fluctuations suggest that factors such as weather, day of the week, and events likely influence average ride duration.”
This section will investigate the annual behavior of the number of trips.
trips_by_user <-all_trips_v2 %>%
group_by(member_casual, year) %>%
summarise(number_of_rides = n(), .groups = "drop")
crecimiento_19 <- (trips_by_user$number_of_rides[2] -trips_by_user$number_of_rides[1])/trips_by_user$number_of_rides[1]
crecimiento_20 <- (trips_by_user$number_of_rides[4] -trips_by_user$number_of_rides[3])/trips_by_user$number_of_rides[3]
ggplot(trips_by_user,aes(x = member_casual, y = number_of_rides, fill = as.factor(year))) +
geom_col(position = "dodge") +
geom_text(aes(label = format(number_of_rides, big.mark = ",", trim = TRUE, scientific = FALSE),#eliminate scientific notation on y-axis
group = as.factor(year)),
position = position_dodge(width = 0.9),
vjust = -0.3,
size = 4) +
labs(title = "Number of Rides ( Annual Comparison )",
y = NULL, # Eliminate the y-axs labels
fill = "Year") +
theme(axis.title.y = element_blank(), #To Eliminate y-axis title
axis.text.y = element_blank(), # To Eliminate y-axis labels
axis.ticks.y = element_blank())+# To Eliminate y-axis ticks
annotate("text", x = "casual", y = 80000,
label = paste0("\u2191 (", round(crecimiento_19 * 100, 2), "%)"),
size = 4, color = "DarkGreen")+
annotate("text", x = 1 + 0.5, y = 370000,
label = paste0("\u2191 (", round(crecimiento_20 * 100, 2), "%)"),
size = 4, color = "DarkGreen")
KEY TAKEAWAYS
Exponential Growth of Casual Users: A 93% increase in trips by casual users indicates a near doubling in the number of trips this group takes from one year to the next. This significant growth suggests a much greater adoption or use of the service by non-subscribed users.
Modest but Sustained Growth of Members: A 10.6% increase in member trips is also positive, indicating that the subscribed user base is growing or using the service more frequently. Although the percentage is lower than that of casual users, it still represents a significant increase given the “quite high” user base.
Different Growth Dynamics: The significant disparity in growth percentages suggests that the factors driving service usage might affect casual users and members differently. There could be marketing campaigns or external factors that are especially attracting non-subscribed users, or perhaps members’ usage patterns are more stable.
In the following chunk, we transform the character string started_at to a standard date format time stamp so that we can make operations easier. In this case, extracting the hour when the ride begins as a new field.
# Convert to POSIXct to keep both date and time
all_trips_v2$start_at_datetime <- as.POSIXct(all_trips_v2$started_at, format = "%Y-%m-%d %H:%M:%S")
# Extract the hour from 'start_at_date_time'
all_trips_v2 <- all_trips_v2 %>%
mutate(hour_started = hour(start_at_datetime))
#Convert 'hour_started' to a factor for proper binning in ggplot
all_trips_v2$hour_started <- factor(all_trips_v2$hour_started, levels = 0:23)
# 3. Create a bar plot to visualize the distribution by hour
ggplot(all_trips_v2, aes(x = hour_started, fill = member_casual)) +
geom_bar(position = "stack") +
labs(title = "Bike Trip Distribution by Hour and User Type",
x = "Hour of Day",
y = "Trip Frequency",
fill = "User Type") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) # Rotate x-axis labels
KEY TAKEAWAYS
Members’ maximum bike usage is during peak hours (7:00 am - 9:00 am and 4:00 pm - 6:00 pm).
Casual riders’ maximum bike usage is during the afternoon, essentially from 12:00 pm to 6:00 pm.
Note that the number of rides is significantly higher for members throughout the day.
In this section, we will show the top ten routes with the most frequent trips to uncover the stations most used by different user types.
The field route will be added as a concatenation of the start and end station names.
popular_routes <- all_trips_v2 %>%
mutate(route = paste(start_station_name, "\n", end_station_name, sep = "")) %>%
group_by(route, member_casual) %>%
summarise(trip_count = n(), .groups = "drop") %>%
arrange(desc(trip_count))
top_10_routes <- head(popular_routes,10)
#ggplot organizes the x-axis alphabetically by default. If I want to show the columns in descending order by the y-axis, we have to convert the route to a tor and then order in descending order of trip_count
top_10_routes <- top_10_routes %>%
mutate(route = factor(route, levels = route[order(-trip_count)]))
ggplot(top_10_routes, aes(x = route, y= trip_count, fill = member_casual)) +
geom_bar(position = "dodge" , stat = "identity") +
labs(title = "Trip Distribution by User Type for Top 10 Most Popular Routes",
x = "Route (Start to End Station)",
y = "Number of Trips",
fill = "User Type") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) #
KEY TAKEAWAYS
Members use the most popular routes (routes with the highest number of trips).
Note that the station Lake Shore Dr. & Monroe St. is the most used by casual riders and the station Michigan Ave & Washington Blvd is the most used by casual riders
It is interesting to uncover the most used routes for casual riders because is the user we need to know the clue stations where they can be located trying to change their status to members.
top_10_routes_casual <- popular_routes %>% filter(member_casual == "casual") %>% head(10)
ggplot(top_10_routes_casual, aes(x = route, y= trip_count)) +
geom_bar(position = "dodge",stat = "identity") +
labs(title = "Trip Distribution by Casual Members - Top 10 Most Popular Routes",
x = "Route (Start to End Station)",
y = "Number of Trips" ) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
KEY TAKEAWAYS
The clue starting point for casual riders is Lake Shore Dr. & Monroe St.
The clue ending point for casual riders is Streeter Dr. & Grand Ave.
Members demonstrate more consistent daily ride lengths, while casual riders exhibit greater volatility, with daily fluctuations suggesting that factors like day of the week and weather more influence their usage.
Members’ total bike usage is significantly higher than casual riders.
The following patterns suggest that casual members use bikes for leisurely or touristic purposes, while members rely on bikes for commuting or transportation.
Casual riders show a general seasonal pattern and a specific increase in spring 2020.
Members use the most popular routes, which are the most efficient and well-traveled.
The most used stations by casual riders are:
Lake Shore Dr. & Monroe St as the starting point
Streeter Dr. & Grand Ave as the ending point
Casual riders tend to increase bike usage in warmer months. Although this tendency has reached all users by 2020, essentially on March.
The segment of casual users is experiencing much faster growth compared to members, although both groups show a positive trend in using the service. This issue demonstrates the effectiveness of the current strategy, building general awareness and appealing to broad consumer segments who don’t purchase the membership. Analyzing the reasons behind these different growth rates and the membership pricing terms can help optimize the business strategy.
Offer a “Leisure to Loyalty” Membership Trial: Capitalize on the casual riders’ tendency to use bikes for longer trips on weekends and afternoons. Offer a limited-time membership trial targeting casual users during these peak leisure times. This trial could provide benefits like discounted rates for longer durations, free weekend rentals after a certain number of casual rides, or access to member-only routes or curated leisure ride suggestions. Promote this trial through the app when casual riders start longer trips or on weekend afternoons.
Implement a Tiered Membership System with Leisure-Focused Benefits: Introduce a membership tier that caters specifically to the leisure use patterns of casual riders. This could be a more affordable option with benefits like extended rental times on weekends, discounts at partner leisure destinations (cafes near popular routes, parks), or the ability to reserve bikes in advance for weekend outings. Clearly communicate the value proposition of this tier compared to the standard membership, highlighting the cost savings for their typical usage.
Personalized Conversion Campaigns Based on Usage Patterns: Leverage the data on casual riders’ preferred starting and ending points (e.g., Lake Shore Dr. & Monroe St., Streeter Dr. & Grand Ave) and their tendency to ride more in warmer months. Implement personalized in-app messages or email campaigns during these times, highlighting membership benefits for their specific leisure routes and frequency. For example, suggest cost savings if they were members based on their past usage, or offer discounts on annual memberships as the weather gets warmer and their riding increases.